DOCUMENT RESUHE 



ED 109 188 



TH 004 638 



EDRS PRICE 
DESCRIPTORS 



ASTHOR Petrosko^ Joseph «.; Hufano^ Linda 

TITLE An Assessment of the Quality of High School 

Mathematics Tests. 
POB DATE [Apr 75 3|. . _ y 

NOTE ^' ' 20p,; Pajjpli'' presented at the Annual/ fleeting of the' 

^ National Council oti Measurement in Education, 

(Washington^ D. C,^ March 31-April 2, M975) 

MF*$0.76 HC-$1.58 PLUS POSTAGE 
Algebra; Comparative Analysis ; *Eyalu'ation ; 
* Evaluation Criteria ; Geometry; *Ma the ma tics; 
*Secondar-y ^Education; Senior High Schools; 
V *Standar dized Tests;>^Telt Construction; Test 

Seliability; 'Tests; Test Validity 

ABSTRACT 

An ' assessmen^t vas made of the psychometric and 
educational. quality of all Kigh school level tests of gfeneral . 
matKematicsr applied mathematics,, algebra and geometry. The study was 
part. -of a large-scale project involving evaluations of all 
standardized secondary school tests available in the United States. 
Assessments revealed ^ost tests to be low in many types of validity 
and reliability . Tests of general ma.t hematics, which included 
arithmetic, fared the best across 39-*criteria of test>^quality . Test 
developers are not meeting many basic standards of test quality in 
constructing mathematics tes;cs. .(Author) 



+ **««;Jc:<t««:#c**««««:}c««*:Jc* **************** 

^ * Documents acquired by ERIC include, many inforijial linpublished * 

* m'aterials not ayailabia^;.frqm other sources. ERIC makes every effort * 
*• to obtain the ^test' copy \ajv^aiiable. '..n^yeprt^ items of marginal * 

* repro'duclbility ^are oft^n 6,ncoun1:ered* a^d ^this affects the quality 

* of the micro'f iclie and hardcopy reptoductions' B^IC makes available * 

* via the ERIX: Document Reprb^uction^&^^^f^K^' C{EDFS). EDRS is not * 

* responsible for" the /qualify of th^ origiri^Tb^s^ocumerit . Reproductions * 

* s'upplie'd by EDRS are the best that Ca/a be ipaa.e.fr'om the original.^ * 



V 



■J ■ 



CO 

oo 
— J 

o 
o 



AN ASSESSMENT OF THE QUALITy'oF HIGH SCHOOL HATHEMATICS TEST$''' 



U S OEPARTMENT OF HEALTH, 
EDUCATION ft WELFARE 
NATIONAL INSVlTUTEOF 

EDUCATION /* 

THIS DOCUMENT HAS BEEN REPRO 
OUCEO EXACTLV AS RECEIVED fROM 
THE PERSON OR ORGANIZATION ORIGIN 

aV.ng .^ po.nt-s Of view or opinions 

STATED 00 NOT NECESSARILY REpRE 
SENT OFF<C.AL NATIONAL INSTITUTE Of 
EIJi/CATtON POSITION OR POLICY 



>• Joseph M, Petrosko 
Center for t\ie Study of Evaluation 
UCLA Graduate School of Education 

and ^ 

Linda Hufano 
Garvey School District , 
Rosemead, California 




CO 
O 

6 



er|c//. 



''Paper presented at the annUal meeting of the 
National Council on Measifrement in Education, 
Washington, D,C., April, 1975 



Between' 1972 and 197^, the Center for the Study of , Eva 1 uat i on (CSE) 

systematical 1 y 'evaluated the great majority published tests on the 

^secondary^educat ion level (Grades 7 through 12), Evaluations of more 

than Shop tests oc subtests of batteries were published in a set of three 

t 

volumes (Hoepfne- et aK, 197^). The evaluations provide the user with 
39 educational and psychometric qua 1 i ty' rat i ngs of secondary-level stand- 
ardized tests. 

This study concerns a subset of the eva 1 uat ion ratings - that'-of 
mathemat Us tests in grades 9 through 12. The objective of the- study'" 
was twofold. I) to compare and contrast the quality of tests in various 
areas of mathematics, and 2) to note those .aspects of test constr.uct i on 
to which developers could direct their future efforts. 

METHOD 

PeV Sonne 1 ' 

All test evaluations were performed by indi.viduals trained in educa- 
tional testing. The majority of test evaluators possessed either an MA 
or a Ph. D. in education or psychology. 

Procedure / 

r 

A multi-step procedure was followed in the evaluations: 
^ K FoMowl'ng a canvass of test catalogs and test publishers, all tests 
suitable or recommended for secondary students, except clinical and projec 
tlve measures-, were ordered*' ' * - ' , 

2. For each tejst, evaluators decided if the i nstr^uments would be eval^ 
uated in whole or in parts. A subtest was evaluated if, it yielded a sepa^-ate 
score which the publisher ^or the organ i zat i on of the test itself clearly ^ 



> 



.ERIC - ^ 



indicated could be 1 nterpreted separatel y , Using thi.s rul e /-a , test was 
evaluated: 1 ) a| a whole and for each of ' the subtests, or 2) or^ly a? a 
whole, or 3) only for the subtests. . * . ' . . 

3. Each test and subtest was categorized by grade level according 
to the. claims or d^^ections of the publisher. In the^ absence of such 1 n-- 
format ion, . test evaluators estimated grade levels according to, common 
curriculum sequences and item d i f f i cu 1 1 ies\ Tests 'wer*e assigned .to one 
or more of three separate categories: 7"8^ 9'10, or '1^^-12,, Tho^e tes.^s. 
that spanned categories (eog. some tests wer^e 1 ate I ed */high school" and 
intended for grades 9 through 12) were evaluated for each gra^e ^com^! na:- 
tion and reported separately at each level'. ' \ • 

. Two rateVs independently assigned each te^t or subtest toNone 

of 298 goal categories - 23^ goals subsumed under 6^ more genera]*'goa 1 5 . 
The goals comprised a set especially constructed by the Evaluation Tech* 
nologies Program run by CSE. Us i ng , textbooks, cur r I'cul um, gu Ides , journal 
articles, and other publications, the goals constit.uted .a comprehensive 
taxonomy of secondary education in terms of student outcomes. The V^ide- * 
ranging collection included traditional subject-matter areas (e.go goals 
in Engl ishj. Mathematics, and Science) , .Vocational and Career Education^ 
Personality Characterj.stics (i.e. goSls in the affective dortiai-n)^ and 
Physical Education. ^ \ 

5, After decisions were made about evaluation of subtests, about 
assignment to grade level, and about categorization into goa^ area, the 
tests were evaluated on 39lcriteria of "test quality/ the 39 criteria 
were grouped into f(iur broad areas: Measurement Validity, Examinee Ap- 
propriateness, Administrative Usabili-ty, and tk)rmed TechnjcaT Excellence 



(yielding the acronym MEAN, evaj uat ion SY|tem^) • These' cr < ter i a were only 
applied to the' materials provided by tte te^p pubUshe^" or distributor, 
, For eacH test or subscale that was evaluated, the reviewer used a 
standard rating form. Every test was i nJd^endent 1 y rated according ro the 
.MEAN^system iyy at least two raters, each forking w-thour access to the 
other^s rat'-ngs. The final adjudication of test assignment to goal area' 
and adjudication of the 39 quality ratings were both perfo'^med by an 
additional rate'. " ^ 

It is important to point out that a stahciard was applied in considering 
supportving information on all tests. Thirteen" of the 39 MEAN crite-ria deal 
with empirical aspects of tests, mostly validity and feliabilityo Fo'' these 

' C ■ ': ' 

cri'teria^ two rules were devised." The sttfglen't samples used I h- generat mg 
empirical data must: ' (1) contain some Students in at least one of the two - . 
^. grades for .a g«ven evaluation (7~8, 9- 1 0 , *i 1 1 2) -and (2) must include stu- 
dents at, but not more than one-grade level above or below these grades. 
Using these rules, a^ test being evaluated for^ Grades 9*10 would receive 
credit for validity or rel iabi 1 i ty 'cri ter ion if student samples contained 

any grade combination that inclCided grade 9 ^nd grade 10, bat did not In- 

' / ' " • ^ 

elude ^ny students at grade 7 or below or grade \2 and above. ^ 

The' pract i ca 1 effect of these rules was to downgrade those tests wh'e-e 
care was not taken 'n reporting data or in plann'ng validity artd rel>ab I 
ty studieSo 'A number of tests had "high school" forms in which a m.x of, 
'^students from all grade levels of high school were used in test deve'op-^ 
ment» Such data were not qredited* For example, the data for the grades 
9*10 evaluation did not receive crisfdit because grade 12 is more thfen one 
grade above grade 10. Similarly, the data for grades 11-12 were not' cred 
ited since^ grade 9 'S more than' one grade below 11. ^ , 



The complete set' of eval uat i on ratings^ along with the" list of goals, 
and a detailed description of the evaluation pr'ocedure are ' conta i ned 'n 
Hoepfner et (197^). The ^present study focuses on tests, given in 
mathema t i cs' for students' in grade^9^through 12 ( i .e con ta i ned in, the 
volumes for grades 9-)0 and 11-12), The%e tests were crosstabulated with 
a number of the 39 evaluation crite/iao^., , " - ' 

Foyr, areas of mathematics were selected for studyc Their descriptions 
foKlow> ' , ■ ^ s 

General Mathematics . " ^ ' 

^Including Ar i thm'eti<'C, Number Concepts Systems and 

» ■ 

Sets; Measurement- 
Applied Mathematics ' >''^ 

Including - Business and Con's'umer Math; Industrial and - 

Vocational Math; Computer Programming; Computer Theory 

and Practicfe. 
Algebra » ' 

Including Algebraic Skills and Concepts, Real and Complex 
Number Systems; Equations and I nequa 1 1 t i es Exponents , 
Radicals, Logs, and Functions; Linear Algebfa. 
Geometry ' * . 

Including - Informal Geometry: Tb^ Nature of Proof in Math^ 
Euc 1 i dean 'P.l ane Geometry; Coordinate Plane Geometryj 

^ ' Sol id Geometry.. 



i ^ \ ' \' ' RESULTS ^ . 

The ratings of tests on several criteria related to content and con- 
struct val»d>ty are ^hown'in Table !• *T wo ■ important aspects of content 
validity we-'e exaoi'ned whether item selection procedu'-es were ^igo^ous 
end whethe'r empir-cai item se 1 ect 5 on .occur red o Fo'' app^oxi mat^l 50^ of* 
the tests across *al-^ four mathematics categories no information was of- 
fered on hov'/ terns were selected ^evicdence was sought on the publisher's 
sources of information for test construction - curriculum guide^s-, text* 
books etc;). Across the categbr ies ,\about 10 percent o^ fewe^ of the tests 
contained a report of any empirical procedures ror'item selection (e.g* 
jury of experts, item analysis, criterion group analys'S, etc.). As wirh 
all validity and reliability criteria, it must be remembered that empfr"-<; 
caj procedures had to be based on samples ^^students Jncludpng, but not 
^ore than one year above or below, the '.age ; ran^^^for' wh i cb the test was 
eval uated • « ' - . ^ 

^ In tonsfuct validity, tests wens exarrvined on ^four 'cr i ter 5a . Few 
reported divergent vahdity i nformat i on^ .(cor/e 1 a t i ons") factor ia'l va^lidTty 
information (factor loadin^gs), or experimental uses of a test (employ j ng t 
i n .experiments or eval ua't ions )*• A fairly large proport^ion of tests in 
General Math and Applied Math were cred i ted with Theoreliical Support. In 
order to be so rated, it was required, thai some justification be gwen of 
the test's existenceo An example of such just i f ica-tfon^might be a statement 

likef "in- the past decade greater attentioQ has been directed by educators 

V - * ' ' . 

to the teachifvg and learning of set theory as a basis fo^ the understanding 

of m$the;nat i cs." , - ' ^' I 



Table 1 



f^ercentages of Tests Receiving Ratings in 
Content and. Construct Validity 



^General Applied 
Mathemat ics Mathemati cs AVgebra Geometry 
(N-322) * ^(N=26). . (N=122) (N=^2): 



Content Val 'd " ly 





1 tern Select 'on c 


• 












Detai led Descr i ptiori 
of 1 tenvSelect ion 






7 


16 ' 


•14 • 


* 


Statement Made on 
1 tern' Se 1 ect i on 




35 


• 0 


1*7 






Wn lnform;^tion on 
Item Selection 




hi 


85 


37 


■S3 

> 




Emp i r i ca 1 \* 
Item Sfe^lection 


YES 
NO 


> 10 
90^ 


0 

100 


' 8 
92 


6 

96 




Construct Val idi ty 














Divergent Val idi ty 
Information ' 

—r S 


YES 
NO 


2 
98 


0 

. 1^0'=- 


1 

99 


0 

100 




Factor i.al Val idi ty 
Information 


YES 

' NO 


2 . 

98 


0 ' .; 

100 ■ 


- 3 
97. 


-0 
100 




Experimental Us^e • 
of Test 


YES 

■ NO 


1 

-99 


0 ' 
100 


2 

98 


0 

100 


9 

V 


Theoretical s.upport 
.given 


— ( — — 

YES 
NO 


70- 

30 . 


65 
35 


8 

92 


8 
92 


t 






V 
























Tab.U 2, shows ratings 


in concurrent 


and. pred i ct i ve 


val idity. 


■ Fo' 



both "types of validity and across tiie four areas of mathematics, few 
. studfes of..Bny kii^i^were reported, although a fair number ofi^ General 



'Math tests had cpncuf^Jt va^diXy corre I at i ons' above ^/O-* rests in , 
Appl led Mathemat ics were litklti better i^n predictive yali^dity thap the 
other math areas, even though the Appl ied* Math' a rea was more clearly 
' related. to mmed I at,^.post-hi gh-5chool 'employment . Presumabl y * the latter 
fact would make collection of data'on some crite^-ion such as job,sucoeS|S 
a relatively stra ightfor>/ard procesSo , - . ^ - 

^ . • ^ • , . - . . . ; • 

^ • For both^'concurrent an<J predictive val !bi ty^* referred to by the 

' Standards for Educational and PsychoJ og i caJ T^s^ts as crltevion- 

related oalidivie^^ test ^eval ua tors judged the quality of the cpitef^ipn 

•i'tserf^* If the c-iterion - a test or a measure of "S.gccess at something 

- was^i:),atent I y irreievah-t or unrelated to the goal area of the^ eva I uated 
u ' • - . / ' * ' - • . ^ 

test, the test was not' cred i ted • ' - ' " " ' • 

- • . ' ' ' \ ' r ' • ' « 

•Table 2 ' ^ • . . 

' Percentages of Tesl^ R.ecei*ving Ratfngs in 

Ibncucrent* and Predictive Validity * ' ' ; 



5 


" ' ** » • 

/ Gejieral . 
.Mathematics 
(N-322) . 


: Applied 
Mathemat ics 
(N=26) 


Algebra 
(N=122y 


Geometry " 
(N=52) 


* Concurrenii 


VaTidHty . 


c 


9 ' 





Studies r«^erred to 
r ^ .70! 

Sfudie^ r-^ferred Vo 
o30 <*r 



No studies 



Predict i ve 



.70^ . ^ . 

referred to 



Val.idi ty 



r >^ o70, -R^leyant criter- 
ia, Inrtectval of >. 1 
month, cr oss-val i dat ion 
shrinkage £ 101 

r 2l o70, Relevant criteria. 



I ntervat 
.30 < r < 



of >, 1 month 
70 or. Quest ion- 



ab le Xr i ter ia 

No study *p^rformed oP* 

Irrelevant Study 
\ ^ 



15 • 


0 


6 . 


k 


1 

2 


0 


• ' 3 . 


' . 2 


-83 


■;100 














■ 1 . 


0 




2 


3 






'2 


5 • 


8 


3 


2 


,92 • 


-.88 


1 \ 


-9A 



il-. 9 



'Table l^hows how^ tests farecl'in reported correlations of^ test -r^test . Jn 

ternal consistencv, antl al ternate-forni ,rel iabM 1 rty * For wfel I over 7^% of the 

tests; correlat ipns.^ fel 1 b.elow .70 or were' not re.po!tted* A fa i r ^percentage 

• * 

(19^) of General Mathematics tests had high internal consistency coefj" i c lents o 
"'For t^st-retest reliabliit^y, tests were credited if the time span between 
testings was one month" or mt>re.*. Retesting with the same form or delated a^te' 
nate form testing were both acceptable. Regarding the criterion- of internal , 
consistency, split-half, Kuder-R i cha rdson , or alpha coefficients were a!^ ac- 
cepted as evidence. For alternate form reliability, either immediate O' de- 
layed testing was» credi ted. 

' ; ^ 

.Table 3 ^ ■ * 

Percentages of^Tests • Rece iv i ng Ratings m ^ . 

Three Types qf Reliability ^ - 



*» 

f « 


General 
Ma themat i cs 
(N-322) 


Appl i ed 
Matheniat i cs 
. {N=26) 


•* 

Al gebra 
"{N=122) 


Geometry 


Test-Retest 
Coefficient , - 










' r > .90 


1 


■ 0 


' 0 


" . 0 . 


,8^ i r < .90 


^ r 


0 ' 


' . 0 


0 


.70 £ r < .80 


2 


0 , . - 


0 


0 

* 


r"< .70 


93 


100 


100 , 


^ 100 


'internal Consistency 
Coef f i c ient 










r ii <;90 


19 


0 


5 


t) 


.80 < r < .90 


8 


h 


10 




.70 i r < .80- - 


0 


11 . 


1, 


0., ■ 


r < .70 . 


■ ■ 73 


. 85 


8k 


89 



Alternate Form 
Coefficient 





r >^ 


.'90 






0 


0 . 


.80, < 


r •< 


.90 . 


.5 


8 


) - ' 
1 . 


0 


.70 <. 


r 


*80 . 


2 




0 




r < 


.70 


' 90 


89 


98 


100 



Up to this pointy the ratings 'Wefe related to 'purely technical qual ities 
of the test. Havever. many of the criteria, in the MEAN test evaluat'on system- 
pertain to -broader" issues such as (1) test> interpretation, .(2), quality of sco.-e 



'distribution,- and (3) ut ' 1 i ty of a test for decision making. V 

, . \ ' V . . • . 

1/ -Table ^ conta'os'tf ^ ve cr i ter i a rel ated to test i nter p^etat J on, F ^<^r: 

on the DOS i t i ve, s ' de most tests showed thei^ cap^bH • ty of be^hg interpreted 

by the school sta^f athe'- than »by a- special i st\' Further, sco^e conveys on 

was usual ly s triple (on^ step from raw. score to scaled score) and 50 percent o*- 

more of the tesets had commonly used-converted scores, such as pe^centMe ranks 

or grade equivalents^. Less positive were trhe findings -shown in, Norm Range. * 

Most tests were r^'st r*icted in range, that is, the upper and lower lim«ts o*" ^ 

the norm group were less than' two y^ars beyo'nd th^ levels for which the test 

was evaluated^ For example, most tests evaluated for grades Q^IO Kad no»"m 

groups that' did not contain 8th graders or 12th graderSo Also, norm groups 

were rarely nat i'onaH'y.- represent'af i ve , and failed to achieve geographical 

representation or to use random sampling procedlires. (See Table 4, page lO,) 

2. Several other criteria on norming procedures artjj scores are worthy 

of a separate Table. As can be seen in Table 5, about two-thirds of mathe- 

matics tests had repHcabJe besti ng procedures. In other words, procedures*, 

of administration, ^scoring, and interpretation wer^ sufficiently standardized 

so that results coufd bd 'dupl i ca ted x>r replicated from the norm group, ^ Quality 

of score distribution and of sqore graduation varied amohg the areas of mathe- 

matics* About 75 percent of the Algebra and Geometry tests had badly skewed 

distributions (or no i.nformatidn available at all) and had rathe"^ crude convert 

ed s^cores such as quartiles. tests in General Math and Applied Math tended to 

have bette'^r score distributions and more graduated standard scores. (Se^ 

Tabl e 5 , page H . ) ' 



TabJe k 



Percentages of Tests Receiving Ratings on 
Crileria Rela*ted to Test I nte rpretat i,on . 





Genera 1 
Mathemati cs 
(N-322) ' 


AppI ied 
Mathe'mat ics 
(N=26) . 


Aigeb-a 
{N=>22) . 


.J 

Geomet ry 
^(N=52) 


Norm Range * ' 


• 








At least 2 years 


15 


k 


1 ' 


0 


Restricped range $ 


■ 85 


96 


99 


IQO 


Score Interpretation 










Common and simple 

a 

converted scores 


62 


i 

81 


51 , ' 


52 


'Novel, ambiguous, o'^ 
no converted Scores 


38 • 




A9. 


A8 • \ 


Score Con\/ersiop . 










Simpl-e or no 
convers «on "* 


> 77- 


81 


82 


83 


Poor Tables or 2 
step cpnversion 


20 ' 




1 7 


1 7 


•Complicated conve'^s'on* 


r 

2 


0 


1 


0 


Horm Grou|f)" 










NationaHy repr es^entat i ve '8 


0 


2 


2 


Not .nationally rep»'e- 
sentat've 


■ 92 


100, . 


-98 


'98 


Score Int^erpVets^ 










School, staff • • 


98 ' • 


..96 


.100 


100 


Special ist 

, .... 


2 


k 












* 

* 




Common and simple were" pass/fail, percentile 
deviation iQ's, and grade *equ i valents • 


ranks, mental 


ages*, 



Nationally representative meant having at least four of the foi lowing 
attributes: (I) cluste'', stratified, or random sampling; (2) norming 
less .than five years old; (3) all areas o'f U.S« sampled; (A) approp^j- 
atje age range represented and exhausted; (5) racial/ethnic representa- 
,tion,o,r separate norms for such groups^ (6) urban^ suburljan, and rural 
sampi ing, , • s • - 

. ' ' • 10 \ 



* ' Per^centage pf Tests Receiving Ratings on Criteria of 
RepI ! cabt T'' cy of Standard i Zffti on Procedures , ;Range of Coverage, 
and Quality of S'core. GraduatiOt^ 



J* * 
1 

' V 

* 

. . ^-i 


jGeneraJ 
Ma*thema tTcs' 
(N:=322) ^ 


App lied ' 
Mathemat i cs 
<N=26) 


. Algebra. 
. (N=l22) 


-Geomeity' 
. ('n=52^ ■ 


Can the tes t'> ng^ procedure^be - ' 
' duptJcated? Ave p'^ocedures * 
of administration, scoring, 
and » ntfe^p'-etat ^on stand: ' 
ardized? ' , ♦ 


^ ~ ~ 








- - , y 

' • YES 


67 * 


^3 


76. 


58 


. * ' . ' NO 
* 


• 33 


. 27 


2^1 




Dpes the te^t* Kave.an ade- 
^^quate range o^ cove ''age? 
'(^High ceHing, low 
floor^, symmet'*>cal d's*' 
t'ribution) y 

Ta i 1 s. of d J st-r i but ; on^ 

d^r^awr/out r^'f loor or' 

ceiling no^^t ''^ached 
** • 


1 

r • 


• 

23 


• 

15- 


13 


* One tali of dist^'i'&U" 
tfoh drawn tout, floor 
, or ceiling npt reached. 


k" 


15 




2 


Floor or ceiling reached 


17 


12 


5 

1 


2 


No 1 oformat 1 on- on score 
, distribution or 6a6\y 
skewed 


. ■ 52 


50. 


'78 


. * 83 


Quality of Score Graduation 










Percentiles, -gr-ade equi^- 
valents. O'- ment,al ages 


AO 


A2 


. 22 


(. - 

: 23 


Deciles^ s'tan'nes^ 
T-scores, oi Z-scores 


16 


8 

v. ■ 


6 


2 


• ■it 

1 Pa^s~fa*jlj quart lies, ^ or 
novel "scales 


kk ' 


• 50 


• 72 


75 



3. A f i;nar cr^i ter ion, one\well worth looking a^t since it impinges'on 
» / ' , ' ^ . f ' ' 

the reality of the school world in s\jch a direct way, -is the decision-making 
utility of a test. How well does a test "map" the range of scores into the 
domain of decisions about the educational fate of a student? Table 6 shows 
that few tests give prescriptive decision-making information (e.g*, "a sco'-e 
of 30 or more means that the student will very likely succeed if channeled 
in.to introductory algebra"). Few tests in Applied Math, an area presumably 
involving ski Ms useful m post.-^high school vocations, yielded any in^'orma-^ 
tion for decisions. . . - * ' . • ' 

Percentage of* Tests Receiving Rat ipgs "'on^^^^-^ 
Criterion of Decis ion-Maki'ng Uti'lity 





General 
Mathemat i cs 
(N=322) 


» Appl ied 
Mathemati cs 
(N=26) 


Algebra 
•(N=122) 


Geometry 
(N=52) , 

; 


Does*ther test prov^ide 
information useful 
for making any indi- 
vidual or group 
deci s i ons? 










Definite, prescrip- 
tive dec i s i ons 


2 




• 0 




- ^Suggestive deci - 
s i.ons' 


27 


0 


19 


23 - 


Poor gui de 1 i nes 
fot^ decisions 


•20 


8 . 


35 


' ■ / 

hi/ 


Little or no in* 
format ion- for 
decisions ) 


51 

, <c- 


92 


« 


29 



— «7 



12 

14 



D.ISCUSSION 



Where mathematics tests fail '. ' /' 

■ ' — " ■ ^ * ;- , ' \ 

■ • ■ , , 

This survey of b,[ gh-sqhool mathematics "tBSts rfeyealed many tests to 

be* de'^f i c' ent 1n.bas«e aspects of test quali^t^. The defic'encies extended 
\ acros-s fourv'major currlculLrm areas and m^ny trlteria for judging a test. 
A prime exaixipl^e concerns the criterion of coptent validity - a sine qua non 
of achievemenX rests. The present stutjy reveajed ^hat fewer than 20 peV- 
cent of seconda-^y-math tests g^ve a cjetailed description of i tem-sel^'ct ion / 
pf^Qc^dures. Rema.rMbly few made even a ginera.l s-tatement on item selection 
\ (e,g., "current textbobks "were ^urveyed*')/V. ' / ' . . 

This i.s" far f rom^ ^he^^req-uTr^ment expressed Standards for Educational 
and Psychological >Tests (197^),, wher^/it i s deemed' ^ssewttaZ that siJcb in- 
farmation be providecf.; '"'If test' performance is to be interpreted as ^ re- 
'presentat i vers<^mp1e o^ performance/ i n a uni\Aerse of situations, the test 
manual should give a clear .defi^ni tion of the universe represented an^d des - . 

J cr1be the: proct'edu re's followed ill th^ sampling from it.'' (p, kS) 

' ' ' ' ' ' * " . -. * ' ' 

/ It i's not altogether cynical to. ci^qs ider the -poor results, in light 

'**.,'''' • 

of the unchanging' ^ccnbmics of "test deve lopnient . There is> ^no way t6 "fes-- 

cape the real izat i6n. , that ratings tended to»be higher for those criteria 

Jwhepe it was relatively cheap and easy to pr'o.vide the informationo This 

/'real" fact holds true across allotypes of tests. Since'-onost types o'f 

• ' ..'^validrty studies require- the expense of adjrtinisVering other tests or the 

^ **icol le-9ting .of datAo,n*somQ criteVion, the.workwas simply not doneo For * ' 

' ';re hia*^ i 1 i ty (note'^a^l^ 2) > ratings were J^est for interna] consistency 

''.reliability - a 'co.efffi cT^nt that r6qui rje^* oilly one test- acfininistration.*^, 

" « * \ * , " • • ' • r *v 

*"rV-'itJ' ' ; , The^ 'T.nfelrence is t nescap^^ble^ - 



•-.••.V--. . t 



•e - 



Comparisons of tesfs across the four areas of* mathemat • cs may seem 

complicated by the wi de 1 y ,d ive rgent numbers, in the categories* There we^e 

more General Mathematics ters^s^ than the other^th^ee areas combined. How, 

ever, it »s ImportCant to S'onsider that virtual populations of tests are 

being examined not samples from populations. ^ Any comparisons of Genera) 

Mathematics, Applied Mathematics, Algebra, and Geometry can be assumed to 

pertain to the entire popu 1 at i ons' of these types of standard'zed secondary 

tests. I.n that sense, all percentage' 'd i fferences among th^ 'groups ^are 

"^signTf i cant differences'', although not necessa^i ly pra<?ttcat7y signifjcant. 

Practical significance depends on the value assumptions of the reader. 

^Applying tKe arbitrary standard, of 10 percentage points being p^-acti^ 

cally significant one can make a few general statements about^ tests \f\ the 

J 

various areas. In cqmparison with Applied Mathematics; Algebra, or Geome- 
try, ,a larger percentage of General Mathematics tests had;' high concurrent 
validity, high interna! cons i s.tenc'y, and a norm range covermg at least 

, T 

two years. ^ * • ^ 

Seventy percent of the General Mathematics and only eight percen.t>of, 
the Algebra and Geometry tests were rated as having "Theoretical Support," - 
but this result njust be interpreted carefully. The MEAN criterion of 
IheiVetical Support had the most saliency for tests in the affective area, 
where a th^O'-etical construct was inherent injthe goal statement (e.go self 
concept, emotional securi^ty). With achievement tests the crite-^ion ^e- 
fleeted a concern that some kind of statement justified the testes exist- 
ence (and not necessarily wjth evidence support i ng^the statement). Marty 
General Mathematics tests were, in ef feat , ' ar i thmet i c testSo There are 



1^ 

< 



16 



many such • ns t j-uments on the^ ma rket and most teachers can easily write 
arithmetic items so publishers may have felt more necessity for provid'lnsh 
a rationalfe for such tests^' ^ ' 

Tests of Algebra and Geometry tended to have a greater number of 
tests w» th poor "^ange qf ^coverage (inadequate floor and ceMi'ng -etc.) and 
witn crudely graduated scores, such as quartiles. Both phenomena were un-. 
doubtedly. caused by factors such as small sample Sizes m no'-m groups and 
lack -of" rigor ip te rm- rev i s i on procedures,* 

How tests can be improved 

The reader who expects a startlingly innovative declaration of how 
Math Tests can be improved will be d i sappointe;J« If test developers ca-e-. 
fully applied the existing technology ofj test construction, there, would be 
a great improvement i n h nst rumentat i on. If one had to prescribe where the 
efforts of test developers could be directed, the; general answer would be 
to conduct more valid-ity and reliab'ility studies. 

The perennia'My obvious requirement of content valid' ty does not seem 
to be taken seriously by many publishers. Too few developers carefully f 
define the skills they are purporting to measure and then sample sterns 
from a univer.se of such skills. Furthermore, a greater numbe- of tests 
should have nationally representative norm samples better score di'str^bu^^ 
tionSj and more discriminative types of standard scores. Quarts le scales 
based *on all-white suburban samples simply do not do the job for many test 
purchasers. And^fin^lly, tests should relate to 'the real world. For ex^ 
ample, every test in Applied Mathematics should have some type of predict' 
val idi ty infbrmat ion preferably in terms of job performance. 



. 15 

17 



The^state of testing in an educat i onal . area does not exist in a vacuum. 
It botji affects and is affected by the state of the cut i cul urn. So' nofonly 
would a clearer conception of mathematics lead to better tests, but better 
tests may well lead to clearer conceptions of mathematics. 

Mathematics,, Mke many other parts of the school curriculum, began un/ 
dergoing close examination about 15 years ago. liew curricula were developed 
unfortunately not always with tTC* firmest empirical bas's. Much of the pro 
blem lay with Inability to measure the various skill areas in mathematics. 
For example, Romberg *(1969) noted: "It 'is safe to generalize that in most 
mathematics studies conducted /uring the 1960's, researchers used inappro* 
priate or inadequate measuring devices 'to assess mathematics ach ievement oV 

(p. A82) • • . ^ ' ^ 

- ' ' . ' - *' 

When new^urricula have been compared with traditional approaches^ the 

efforts have been hobbled by weaj<;nesses in tests and testing programs. 

This point is well brought out by Walker and Schaffarzick (197^) in a re- 

^view of re'search studies where old and* new curricula w^re compared; "The 

most important shortconxing of conventional achievement tests and the most 

* • . • i . 

serious single limitation of comparative^ curricular^studies-done so far 
is the restricted range of outcomes measured." (p.ip6) Conventional tests 
tend to measure conventional outcomes, and then without the degree of valid- 
ity and reliability th^t would farm the bes.t evidence fo*- decision making 
abo.ut those being tested. 



16 18 



The pioneer efforts of the National Long i tud i na I ^ Stud<^ of Mathematical 
Abi litres (NLSMA) ar^e* cited by Walker and Schaf f arz-i ck (197^) > Dess^art and 
Frandsen (1973) and others as a positive example of what can be done in 
curriculum and test construction. Test items were carefully linked with 
t^e content areas of mathematics whioh in turn were I'nked with fou'* main 
elements of achievement; Computation, Comprehension Appjication, and ^ 
Analysis, The state of mathematics testing can only be improved if other 
researchers will ntake similar effortSo We need tests that are relevant • 
to the needs of educators and possess the technical quality necessary fo' 
sound research. . 



References 



American Psychological Association. Standards for Educational -and 
Psychological Tests . WashJ ngton , D.C. : APA, 197^-. 

' 1 ♦ " 

I p 

Dessart, D'.J., S Frandsen, H. Research on teaching secondary school 

mathematics. In R.M.W. -Travers (Ed.)- Second Handbook of Research 
dn Teaching . Chicagoj "Rand McNal ly," J973,* 



.Hoepfner, R., Conniff Jr., W.A.,-et al • CSE Secondary School Test 
Eval uat ions . Los Angeles: Center for the .Study of Evaluation, 
> University of California, 197^. 3 vols. 



Romberg, T.A. Current research i n mathemat i cs education. Rev iew oT' 



Educat i ona' 



Walker, D.F. , & 
Educat i ona 



Research, 1969, 39, ^73-^91. 



Schaf farz i ck, J. Comparing curricula. Review of 
Research, 197^, 83-111. 



i 



